Statistic Analysis of HDD Failure

Matthew Unrue, M.S. Data Analytics

The Problem

Hard disk drives are necessary for all data storage, but like all computing hardware, failure is not an if issue, it is a when issue.

The Problem

When failure occurs, the data stored on these drives is lost. Recovery may be an option, but it can cost up to $7,500 per drive.

The Problem

Sherweb Cloud Solutions found that “While 57% of IT managers have a backup solution in place, 75% of them were not able to restore all of their lost data. In fact, 23% of people with a backup solution in place weren’t able to recover any data at all." (Painchaud, 2018)

The Problem

Keeping a constant backup of every drive could double costs and still not guarantee complete data safety.

The Reality

Our business's data and our operations built on it are simply too important to leave to chance. We need to do more than just backup data for when a main drive stops working. We need a solution that can help us determine when a drive is likely to fail so that we can backup the data off of it and replace it before loss occurs.

The Hypothesis

Study factors significantly indicate impending hard disk drive failure and can be used to predict failure on the day it happens.

The Hypothesis

The Study Factors

  • Drive information like manufacturer and capacity
  • Drive SMART attribute values
  • Drive SMART category values

The Hypothesis

Self-Monitoring, Analysis and Reporting Technology (SMART)

SMART reports valuable attribute indictors of drive performance and diagnostics information that shows the health of a drive in dozens of ways.

The Data Used

  • 2019 4th quarter data provided by American cloud storage provider, Backblaze
  • Totaled 10,991,209 hard drive operation days
  • 131 columns of data to leverage for results
  • Only 678 failures
  • Failure rate of 0.006%

The Data Preparation

Every analysis starts by checking and cleaning data.

  • 63 redundant attributes removed
  • A few thousand row instances removed because of errors or blank data
  • Millions of missing values filled with means and medians
  • 13 new attributes created to replace others to better represent the existing information given missing values

The Data Preparation

Dataframe%20Distributions.svg

The Data Preparation

Dataset Splitting

Before analysis, the data was carefully split up into 3 sets.

  • 70% For training models on
  • 20% For testing the models
  • 10% For a final prediction of solution success

Keeping the sets of data completely separate allows for confidence that the solutions will work with data they have never encounted before.

The Data Preparation

Working with Extremely Rare Events

The dataset started with nearly 11 million instances, yet only had 678 failures. Drive failure in the 4th quarter was a 1 in 16,210 chance, but 7.4 drives on average failed every single day. Extremely rare events like this require very special techniques in order to create solutions for.

The Data Preparation

Working with Extremely Rare Events

A technique called Synthetic Minority Over-Sampling Technique (SMOTE) was used to balance the training dataset.

  • SMOTE was used to create synthetic instances of drive failure until there was an equal number of failure to non-failure instances.
  • These instances were mathematically closer in every way to actual failure instances than non-failure instances.
  • If this wasn't performed, the models would only maximize their correct predictions by only guessing non-failure, rendering them useless.

The Data Analysis

The analysis formally begins with calculating the relations of each attribute with each other. Corr%20Heatmap.svg

The Data Analysis

Dimensionality Reduction

  • A technique called Principal Component Analysis (PCA) was used to reorient and transform the numerical data.
  • This allowed 35% of the data to be removed while only losing about 18% of the information contained in the data.
  • This speeds up the creation of and testing of new solutions for the problem at hand with the least amount of impact mathematically possible.

The Data Analysis

Finally, these results were used to create 6 different models. These models leverage machine learning and artificial intelligence techniques to solve the problem of predicting drive failure.

The Findings

The attributes most closely related to drive failure are:

  • SMART Attribute 5: Reallocated Sectors Count
  • SMART Attribute 9: Power-On Hours
  • SMART Attribute 197: Current Pending Sector Count

The Findings

The Success Statistics of the 6 Models

Model Sensitivity Specificity Precision Error Rate ROC AUC
Logistic Regression 0.6397 0.9732 1.1478e-3 2.68% 0.8729
Decision Tree 0.4412 0.9690 0.8829e-3 3.10% 0.6900
Random Forest 0.3603 0.9903 2.2900e-3 0.98% 0.7974
Class-Weighted Random Forest 0.4044 0.9717 0.8858e-3 2.83% 0.7998
Simple DNN 0.6176 0.9185 0.4696e-3 8.15% 0.7681
Complex DNN 0.7132 0.9364 0.6946e-3 6.36% 0.8248

The various metrics of success all show that these models relying on the information of the study factors all perform significantly better than chance.

The Limitations

  • The dataset was missing many values.
  • Using SMOTE was absolutely necessary in this instance due to the extreme rarity of the failure instances. The process does introduce some bias into the results though.
  • Due to computing memory and project scope limitations, certain analyses were unable to be performed.
  • The success of these statistical models is only guaranteed on drives that are similar to or the same as the drive models in the project dataset. Thankfully, the dataset covered an extremely large range of drive types and models.

The Limitations

  • Finally, this dataset is not balanced enough to make accurate assumptions about manufacturer reliability and performance even though drive manufacturer was a useful value to make predictions with. Manufacturer%20Distribution.svg These proportions of drive manufacturers would need to be near equal to make accurate assumptions about their performance alone.

Actions Proposed and Expected Benefits

Implement either the logistic regression model or the more complex DNN model into the daily drive diagnostics checks and backup procedure pipeline.

  • The complex DNN will successfully flag 71.3% of drives expected to fail each day, but have a false positive rate of 6.36%.
  • The logistic regression will successfully flag 64% of drives expected to fail each day, but have a more conservative 2.68% false positive rate.
  • Either of these solutions being implemented will allow for a total backup and retirement of drives before they fail, saving time, money, and effort every day.

Actions Proposed and Expected Benefits

Until this solution is in place, special care should be taken to consider the drives whose SMART attributes are higher in the 3 main study factors:

  • SMART Attribute 5
  • SMART Attribute 9
  • SMART Attribute 197

Actions Proposed and Expected Benefits

After the model solution is in place, additional research is warranted past the limits of this study.

  • Building a recurrent neural network (RNN) is certain to have increased success than the solutions presented today.
  • The solutions mentioned today can actually be combined into an ensemble, where they all work together and cover each others' weaknesses.

References

Backblaze. (2020). data_Q4_2019. San Mateo, CA; Backblaze.

Painchaud, A. (2018, October 31). 8 Reasons on How Data Loss Can Negatively Impact Your Business. https://www.sherweb.com/blog/security/statistics-on-data-loss/.